Skip to content

Refresh ephemeral information before cluster operations in VTOrc#10115

Merged
GuptaManan100 merged 12 commits intovitessio:mainfrom
planetscale:vtorc-prs
Apr 22, 2022
Merged

Refresh ephemeral information before cluster operations in VTOrc#10115
GuptaManan100 merged 12 commits intovitessio:mainfrom
planetscale:vtorc-prs

Conversation

@GuptaManan100
Copy link
Contributor

@GuptaManan100 GuptaManan100 commented Apr 20, 2022

Description

This PR adds the functionality to VTOrc to refresh its ephemeral information before running any cluster operation since there might be some other agent which could have run a cluster operation in the mean time. To ensure we are not making decisions using stale data, after locking the shard, we refresh the information for all the tablets. We then verify that the problem is indeed there and then go ahead to fix it.

After this change, running PlannedReparentShard from vtctld will not cause vtorc to go bonkers and run a reparent operations of its own. It will see something has changed but after refreshing its information, see that everything is okay and not do anything unnecessary.

The way the code is structured now, it was very hard to add a unit test, so for now I have added end to end tests for this too, but this code can be refactored to be much simpler, which will also allow us to add strong unit tests.

Up until now, Vtorc has not have had the capability to refresh only some tablets/shards information. The refresh we have currently refreshes all the tablets information which is unnecessary especially when we are holding the lock. This is marked as a TODO in this PR and will be fixed shortly. Not adding this change in this PR, to keep it easy to review.

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests were added or are not required Yes
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Manan Gupta <manan@planetscale.com>
…y before running ERS

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…f the file to use

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Cluster management release notes labels Apr 20, 2022
Signed-off-by: Manan Gupta <manan@planetscale.com>
…as well

Signed-off-by: Manan Gupta <manan@planetscale.com>
…nning PRS

Signed-off-by: Manan Gupta <manan@planetscale.com>
…re running any fix

Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 changed the title Vtorc prs Refresh ephemeral information before cluster operations in VTOrc Apr 20, 2022
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 marked this pull request as ready for review April 20, 2022 14:56
Comment on lines -658 to -663
if checkAndSetIfERSInProgress() {
AuditTopologyRecovery(topologyRecovery, "an ERS is already in progress, not issuing another")
return false, topologyRecovery, nil
}
defer setERSCompleted()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to remove this?

Copy link
Contributor Author

@GuptaManan100 GuptaManan100 Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it safe to remove now. Let me start with why it was introduced in the first place. I introduced this logic when I initially changed the checkAndRecoverDeadPrimary to use EmergencyReparentShard in #8492.
Before this PR, we used to lock the shard as part of the ERS code, i.e. after we have called ERS. Because the cluster failure checks happen on time ticks and repair operations in separate go routines, we sometimes had situations where two ticks of failure check spawned two different threads of repair, each which would have tried to run an ERS.

The LockShard functionality is blocking and doesn't return an error if the shard is locked. This meant that if there were two calls to ERS, one would get the lock and the other would wait for it. After the first finished its execution, we had another run of ERS. So, we had two runs of ERS for a single failure, which was not correct. The solution that we decided on then, was to add this code, which was a mutex lock and a variable to know if ERS was in progress from this same VTOrc instance. This prevented us from spawning another ERS call when one was already running.

This was a temporary workaround till we addressed federation like we have now. Since we lock the shard much earlier on and check again if the failure is required after refreshing the information, we no longer need this type of locking. Even if there is a second call to checkAndRecoverDeadPrimary it would block on acquiring the lock in the beginning, and once it does, it will reload the information and then decide whether to run ERS or not. So, if the first call to ERS fixed everything, there would be nothing to do and we exit out.

So, the change in this PR (along with one more to follow(described below)) actually is powerful enough to address three concerns -

  1. Redundant operations from the same VTOrc instance (Like we see above)
  2. Redundant operations from VTOrc when a cluster operation is run from vtctld (Like PRS, which I have added a test for now)
  3. Federation between different VTOrc instances. With this change, it should be safe to have multiple VTOrc instances without them stepping on each others toes because this change guarantees us that we only run a cluster operation when it is needed.

For the sake of completeness, I will also add that this reload of information has been added to 3 of the 5 functions we use for fixing things. The two that I have not yet added this to, are fixPrimary and fixReplica which only fix the replication parameters on the primary and replica respectively. I intend to make that change as well, but I didn't want to add any code that didn't have an associated end to end test with it. So, I'll make this change in the next PR which first refactors the way we get the replication analysis and run the operations. Which will enable us to add unit tests for this change too.

@GuptaManan100 GuptaManan100 requested a review from deepthi April 22, 2022 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Cluster management Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants